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Abstract. Particle learning (PL) provides state filtering, sequential pa- 
rameter learning and smoothing in a general class of state space models. 
Our approach extends existing particle methods by incorporating the 
estimation of static parameters via a fully-adapted filter that utilizes 
conditional sufficient statistics for parameters and/or states as parti- 
cles. State smoothing in the presence of parameter uncertainty is also 
solved as a by-product of PL. In a number of examples, we show that 
PL outperforms existing particle filtering alternatives and proves to be 
a competitor to MCMC. 

Key words and phrases: Mixture Kalman filter, parameter learning, 
particle learning, sequential inference, smoothing, state filtering, state 
space models. 



1. INTRODUCTION 

There are two statistical inference problems asso- 
ciated with state space models. The first is sequen- 
tial state filtering and parameter learning, which is 
characterized by the joint posterior distribution of 
parameters and states at each point in time. The sec- 
ond is state smoothing, which is characterized by the 
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distribution of the states, conditional on all available 
data, marginalizing out the unknown parameters. 

In linear Gaussian models, assuming knowledge 
about the system parameters, the Kalman filter (Kal- 
man, 1960) provides the standard analytical recur- 
sions for filtering and smoothing (West and Harri- 
son, 1997). For more general model specifications, 
conditional on parameters, it is common to use se- 
quential Monte Carlo methods known as particle fil- 
ters to approximate the sequence of filtering distri- 
butions (see Doucet, de Freitas and Gordon, 2001 
and Cappe, Godsill and Moulines, 2007). As for 
smoothing, the posterior for states is typically ap- 
proximated via Markov chain Monte Carlo (MCMC) 
methods as developed by Carlin, Poison and Staf- 
fer (1992), Carter and Kohn (1994) and Friihwirth- 
Schnatter (1994). 

In this paper we propose a new approach, called 
particle learning (PL), for approximating the se- 
quence of filtering and smoothing distributions in 
light of parameter uncertainty for a wide class of 
state space models. The central idea behind PL is 
the creation of a particle algorithm that directly 
samples from the particle approximation to the joint 
posterior distribution of states and conditional suffi- 
cient statistics for fixed parameters in a fully-adapted 
resample-propagate framework. 

In terms of models, we consider Gaussian Dy- 
namic Linear Models (DLMs) and conditionally Gaus- 
sian (CDLMs). In these class of models, PL is de- 
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fined over both state and parameter sufficient statis- 
tics. This is a generalization of the mixture Kalman 
filter (MKF) of Chen and Liu (2000) that allows 
for parameter learning. Additionally, we show that 
PL can handle nonlinearities in the state evolutions, 
dramatically widening the class of models that MKF 
particle methods apply to. Finally, we extend the 
smoothing results of Godsill, Doucet and West (2004) 
to sequential parameter learning and to all the mod- 
els considered. 

In a series of simulation studies, we provide sig- 
nificant empirical evidence that PL dominates the 
standard particle filtering alternatives in terms of 
estimation accuracy and that it can be seen as a 
true competitor to MCMC strategies. 

The paper starts in Section 2, with a brief review 
of the most popular particle filters that represent the 
building blocks for the development of PL in Sec- 
tion 3. Section 4 in entirely dedicated to the appli- 
cation of PL to CDLMs followed by possible exten- 
sions to nonlinear alternatives in Section 5. Section 
6 presents a series of experiments benchmarking the 
performance of PL and highlighting its advantages 
over currently used alternatives. 

2. PARTICLE FILTERING IN 
STATE SPACE MODELS 

Consider a general state space model defined by 
the observation and evolution equations: 

yt+i ~p{yt+i\xt+i,Q), 

x t+ i ~p(x t+1 \ Xt ,0), 

with initial state distribution p(xo\9) and prior p(9). 
In the above notation, states at time t are repre- 
sented by xt while the static parameters are de- 
noted by 9. The sequential state filtering and param- 
eter learning problem is solved by the sequence of 
joint posterior distributions, p(xt,9\y t ), where y l = 
(yi, . . . , yt) is the set of observations up to time t. 

Particle methods use a discrete representation of 
p(x t ,9\y t ) via 

1 N 

i=l 

where (xt,9)^ is the state and parameter parti- 
cle vector and <5(.) is the Dirac measure, represent- 
ing the distribution degenerate at the N particles. 
Given this approximation, the key problem is how 
to sample from this joint distribution sequentially as 



new data arrives. This step is complicated because 
the state's propagation depends on the parameters, 
and vice versa. To circumvent the codependence in 
a joint draw, it is common to use proposal distri- 
butions in a sequence of importance sampling steps. 
We now review the main approaches of this general 
sequential Monte Carlo strategy first for pure filter- 
ing and then with parameter learning. 

2.1 Pure Filtering Review 

We start by considering the pure filtering prob- 
lem, where it is assumed that the set of parameters 
9 is known. Although less relevant in many areas of 
application, this is the traditional engineering ap- 
plication where both the Kalman filter and original 
particle filters were developed. 

The bootstrap filter In what can be considered the 
seminal work in the particle filtering literature, Gor- 
don, Salmond and Smith (1993) developed a strat- 
egy based on a sequence of importance sampling 
steps where the proposal is defined by the prior for 
the states. This algorithm uses the following repre- 
sentation of the filtering density: 

p{x t +i \y t+1 ) oc p(y t +i \x t+1 )p(x t +i \y*), 
where the state predictive is 

p(x t+ i\y t ) = j pixt+ilx^pixtly 1 ) dx t . 

Starting with a particle approximation of p(xt\y t ), 
draws from p(xt+i\y t ) are obtained by propagat- 
ing the particles forward via the evolution equation 
p(xt+i\xt), leading to importance sampling weights 
that are proportional to like likelihood p{yt+\\xt+i). 
The bootstrap filter can be summarized by the fol- 
lowing: 

Bootstrap filter (BF). 

Step 1 (Propagate). {xf*}f =l to via 
p(x t+ i\x t ). 

Step 2 (Resample). {x[ l ^_ 1 }fL 1 from{x^ 1 }^ 1 with 
weights w'fli ocp(yt + i|4+i)- 

Resampling in the second stage is an optional step, 
as any quantity of interest could be computed more 
accurately by the use of the particles and its associ- 
ated weights. Resampling has been used as a way 
to avoid the decay in the particle approximation 
and we refer the reader to Liu and Chen (1998) for 
a careful discussion of its merits. Throughout our 
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work we describe all filters with a resampling step, 
as this is the central idea to our particle learning 
strategy introduced below. Notice, therefore, that 
we call BF a propagate-resample filter due to the 
order of operation of its steps: 

Auxiliary particle filter (APF). 

Step 1 (Resample). {x^}f =1 from {x^}fL 1 with 
weights 

wili (xp{y t+1 \g{xP)). 

Step 2 (Propagate), {xf ] }f =1 to {x^ii^i via 
p(x t+ i\x t ). 

Step 3 (Resample). with weights 

(i) p(yt+il4+i) 

w t+i a h\ — • 

p{y t+ i\g{x\ 1 ')) 

Auxiliary particle filter (APF) The APF of Pitt 
and Shephard (1999) uses a different representation 
of the joint filtering distribution of (xt,xt+\) as 

p{x t ,x t+1 \y t+1 ) 

oc p(x t+ i\xt,y t+1 )p(x t \y t+1 ) 

= p(x t+ i\x t , y t+1 )p(y t+ i \x t )p(x t 

Our view of the APF is as follows: starting with a 
particle approximation of p(xt\y t ), draws from the 
smoothed distribution of p(xt\y t+1 ) are obtained by 
resampling the particles with weights proportional 
to the predictive p{yt+\\x t ). These resampled par- 
ticles are then propagated forward via p{xt+i \xt,y t+1 ) 
The APF is therefore a resample-propagate filter. 
Using the terminology of Pitt and Shephard (1999), 
the above representation is an optimal, fully adapted 
strategy where exact samples from p N (xt+i\y t+l ) 
were obtained, avoiding an importance sampling step. 
This is possible if both the predictive and propaga- 
tion densities were available for evaluation and sam- 
pling. 

In general, this is not the case and Pitt and Shep- 
hard proposed the use of an importance function 
p(yt+i\jlt+i = g{xt)) for the resampling step based 
on a best guess for xt+i defined by fit+i = g{x t ). 
This could be, for example, the expected value, the 
median or mode of the state evolution. The resam- 
pled particles would then be propagated with a sec- 
ond proposal defined by p(xt+i\x t ), leading to the 
following algorithm: 

Two main ideas make the APF an attractive ap- 
proach: (i) the current observation yt+i is used in the 



proposal of the first resampling step and (ii) due to 
the pre-selection in step 1, only "good" particles are 
propagated forward. The importance of this second 
point will prove very relevant in the success of our 
proposed approach. 

2.2 Sequential Parameter Learning Review 

Sequential estimation of fixed parameters 9 is no- 
toriously difficult. Simply including 9 in the particle 
set is a natural but unsuccessful solution, as the ab- 
sence of a state evolution implies that we will be left 
with an ever-decreasing set of atoms in the particle 
approximation for p(9\y t ). Important developments 
in this direction appear in Liu and West (2001), 
Storvik (2002), Fearnhead (2002), Poison, Stroud 
and Miiller (2008), Johannes and Poison (2008) and 
Johannes, Poison and Yae (2008), to cite a few. We 
now review two popular alternatives to learn about 
9: 

Storvik 's filter. 

Step 1 (Propagate), {x^}^ to {x^^}^ via 

qixt^xf^^ 1 ). 

Step 2 (Resample). {(x t+ i, s t )^}f =1 torn {(x t+ i, 
s t)^}iLi with weights 

Wt+l0C q(xfll\4 } Av t+1 ) 

Step 3 (Propagate). Sufficient statistics Sj^ x = 

S{s[ l) ,x ( il 1 ,y t+1 ). 

Step 4 (Sample). flW from p{9\sf^_ 1 ). 

Storvik's filter Storvik (2002) (similar ideas ap- 
pear in Fearnhead, 2002) assumes that the poste- 
rior distribution of 9 given x t and y* depends on a 
low-dimensional set of sufficient statistics that can 
be recursively updated. This recursion for sufficient 
statistics is defined by st+i = S(st, xt+i, Ut+i), lead- 
ing to the above algorithm. Notice that the pro- 
posal q(-) is conditional on yt+i, but this is still a 
propagate-resample filter. 

Liu and West's filter Liu and West (2001) suggest 
a kernel approximation p(9\y t ) based on a mixture 
of multivariate normals. This idea is used in the con- 
text of the APF. Specifically, let {{x u 9 t )^}f =1 be 
particle draws from p(xt,9\y t ). Hence, the posterior 
for 9 can be approximated by the mixture distribu- 
tion 

N 

p(9\y t ) = Y, N ( mU) ^ 2 V t ), 
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where = a9\ j) + (1 - a)9 t , 9 t = £f =1 @t / N and 

Vt = Ef=M j) ~ 8t){0 [ t j) - e t )'/N. The constants 
a and h measure, respectively, the extent of the 
shrinkage and the degree of over dispersion of the 
mixture (see Liu and West, 2001 for a detailed dis- 
cussion of the choice of a and h). The idea is to use 
the mixture approximation to generate fresh sam- 
ples from the current posterior in an attempt to 
avoid particle decay. The algorithm is summarized 
in the next page. The main attraction of Liu and 
West's filter is its generality, as it can be imple- 
mented in any state-space model. It also takes ad- 
vantage of APF's resample-propagate framework and 
can be considered a benchmark in the current liter- 
ature: 

Liu and West's filter. 

Step 1 (Resample). {(xt, 6~t)^}f=i from {(xt, 
Ot^y-Lx with weights 

^j+i <xp(yt+i\g(x[ a) ),m (l) ). 

Step 2 (Propagate). 

(2.1) {9^ =1 to {flgj^ via N(jh®,V); 

(2.2) {xf}f =l to vi*p(x t+1 \xPMi)- 

Step 3 (Resample). {(x t+1 ,9 t+1 )^}fLi from 
{(x t+1 ,e t+ i) (i) }ti with weights 

(0 Kyt+iR+iA+i) 
w t+i oc u] — — • 

3. PARTICLE LEARNING AND SMOOTHING 

Our proposed approach for filtering and learning 
relies on two main insights: (i) conditional sufficient 
statistics are used to represent the posterior of 9. 
Whenever possible, sufficient statistics for the la- 
tent states are also introduced, increasing the ef- 
ficiency of our algorithm by reducing the variance 
of sampling weights in what can be called a Rao- 
Blackwellized filter, (ii) We use a resample-propagate 
framework and attempt to build perfectly adapted 
filters whenever possible in trying to obtain exact 
samples from our particle approximation when mov- 
ing from p N (xt, 9\y l ) to p N (x t +i, 9\y t+l ). This avoids 
sample importance re-sampling and the associated 
"decay" in the particle approximation. As with any 
particle method, there will be accumulation of Monte 
Carlo error and this has to be analyzed on a case- 
by-case basis. Simply stated, PL builds on the ideas 



of Johannes and Poison (2008) and creates a fully 
adapted extension of the APF to deal with parame- 
ter uncertainty. Without delays, PL can be summa- 
rized as follows, with details provided in the follow- 
ing sections: 

Particle learning. 

Step 1 (Resample). {z\ ; }£Li from zf 1 = (xt,st, 
#)W with weights wt (xp{yt+\\z^). 

Step 2 (Propagate). x[ to xfh via p(x t +i\z^\ 
Vt+i)- 

Step 3 (Propagate). Sufficient statistics = 

Step 4 (Sample). 0W from p(0|sgi)- 

Due to our initial resampling of states and suffi- 
cient statistics, we would end up with a more repre- 
sentative set of propagated sufficient statistics when 
sampling parameters than Storvik's filter. 

3.1 Discussion 

Assume that at time t, after observing y l , we have 
a particle approximation p N (zt \y l ), given by {z^}fL 1 
Once yt+i is observed, PL updates the above ap- 
proximation using the following resample-propagate 
rule: 

(3.1) p(z t \y t+1 ) (xp(y t+ i\zt)p(z t \y t ) 
and 

p{z t+ i\y t+1 ) = j p(s t+1 \x t+1 ,s t ,yt + i) 

(3.2) ■ p(x t+ i\zt,yt+i) 

■p(z t \y t+1 )dx t+1 dz t . 

From (3.1), we see that an updated approximation 
p N (zt\y t+1 ) can be obtained by resampling the cur- 
rent particles set with weights proportional to the 
predictive p(yt+i \zt). This updated approximation 
is used in (3.2) to generate propagated samples from 
the posterior p{x t -\.\ \zt, yt+i) that are then used to 
update st+i, deterministically, by the recursive map 
£(•), which in (3.2) we denote by p(s t +i\x t +i, s t ,y t+ i) 
However, since St and xt+i are random variables, the 
conditional sufficient statistics Sf+i are also random 
and are replenished, essentially as a state, in the fil- 
tering step. This is the key insight for handling the 
learning of 9. The particles for st+i are sequentially 
updated with resampled st particles and propagated 
and replenished xt+i particles and updated samples 
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from p(9\st+i) can be obtained at the end of the 
filtering step. 

By resampling first we reduce the compounding of 
approximation errors as the states are propagated 
after being "informed" by yt+i, as in APF. To clar- 
ify the notion of full-adaptation, we can rewrite the 
problem of updating the particles {z^}f =l to 
{ z tli\f=i as the problem of obtaining samples from 
the target p(xt+i, zt\y t+l ) based on draws from the 
proposal p(zt\y t+1 )p(xt+i\zt,y t+1 ), yielding impor- 
tance weights 



p(x t+ i,z t \y 



1, 



(3 ' 3) Wt+l0C p(z t \y^)p(x t+1 \z u y t+1 ) 
and therefore, exact draws. Sampling from the pro- 
posal is done in two steps: first draws z\ from 
p{zt\y t+l ) are simply obtained by resampling the 
particles {z^}fL 1 with weights proportional to 
p(yt+i\zt); we can then sample xf\ from 



p(xt+i\zt, y t+1 )- Finally, updated samples for st+i 
are obtained as a function of the samples of xt+ij 
with weights 1/N, which prevents particle degen- 
eracies in the estimation of 9. This is a feature of 
the "resample-propagate" mechanism of PL. Any 
propagate-resample strategy will lead to decay in 
the particles of xt+i with significant negative effects 
on p N (8\st+i). This strategy will only be possible 
whenever both p(y t+ i\z t ) and p(xt+i\zt,y t+1 ) are an- 
alytically tractable, which is the case in the classes 
of models considered here. 

Convergence properties of the algorithm are 
straightforward to establish. The choice of particle 
size iV to achieve a desired level of accuracy depends, 
however, on the speed of Monte Carlo accumulation 
error. In some cases this will be uniformly bounded. 
In others, a detailed simulation experiment has to 
be performed. The error will depend on a number 
of factors. First, the usual signal-to-noise ratio with 
the smaller the value leads to larger accumulation. 
Section 4 provides detailed simulation evidence for 
the models in question. Second, a source of Monte 
Carlo error can appear from using a particle ap- 
proximation to the initial state and parameter dis- 
tribution. This error is common to all particle meth- 
ods. At its simplest level our algorithm only requires 
samples 9^' from the prior p(9). However, a natural 
class of priors for diffuse situations are mixtures of 
the form p{9) = f p{9\zo)p{zo) dzo, with the condi- 
tional p(9\zo) chosen to be conditionally conjugate. 
This extra level of analytical tractability can lead to 



substantial improvements in the initial Monte Carlo 

error. Particles Zq^ are drawn from p(zq) and then 
resampled from the predictive and then propagated. 
Mixtures of this form are very flexible and allow for 
a range of nonconjugate priors. We now turn to spe- 
cific examples. 

Example 1 (First order DLM). For illustration, 
consider first the simple first order dynamic linear 
model, also known as the local level model (West 
and Harrison, 1997), where 

(y t+ i\x t+1 ,9) ~ N(x t+1 ,a 2 ), 

(x t+ i\x t ,9) ~ N(x t ,T 2 ), 

with 9 = (a 2 ,r 2 ), x ~ N{m ,C ), a 2 ~ IG(a ,6 ) 
and T 2 ~ IG(co,do). The hyperparameters too, Co, 
a 0> &0) c o an d do are kept fixed and known. It is 
straightforward to show that 

(y t+ i\x t ,8) ~ N(x t ,a 2 + t 2 ) and 
(x t+ i \y t +i , x t , 9) ~ N(m , u 2 ), 

where p, t = Lo 2 (a~ 2 y t+ \ + r~ 2 x t ), u~ 
Also, for scales 



a~ 2 + r- 2 . 



(a 2 \y t+1 ,x t+1 )~IG(a t+1 ,b t+ i) 
(T 2 \y t+ \x t+1 )~IG(c t+1 ,d t+1 ) 



and 



where a t+ i = a t + 1/2, c t+ i = c t + 1/2, b t+1 = b t + 
0.5(y f+ i - x t+ i) 2 and d t+ i = d t + 0.5(x t +i - %t) 2 ■ 
Therefore, the vector of conditional sufficient statis- 
tics st+i is 5-dimensional and satisfies the following 
deterministic recursions: st + i = st + (y 2 +i, Vt+iXt+i, 
x 2 + i,x 2 ,xt+iXt). Finally, notice that, in both, 
p(yt+i\xt) and p(xt+i \xt, y t+1 ) are available for eval- 
uation and sampling, so that a fully adapted version 
of PL can be implemented. 

3.2 State Sufficient Statistics 

A more efficient approach, whenever possible, is to 
marginalize states and just track conditional state 
sufficient statistics. In the pure filtering case, Chen 
and Liu (2000) use a similar approach. Here we use 
the fact that 

P(xt\y t )= / P(^|sf)p(sf|y*)dsf. 



Thus, we are interested in the distribution p(sf \y l ). 
The filtering recursions are given by 



p(st+i\y 



p{s t+1 \s t ,x t+ i,yt+i) 
■p{s x t ,x t+ i\y t+1 )ds*dx t+ i. 
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We can decompose p(sf ,xt+i\y t+1 ) as proportional 
to 

P(Vt+i \st)p{x t+ i \sf , y t+ i)p(s?|y*), 

where we have an extra level of marginalization. In- 
stead of marginalizing xt, you now marginalize over 
sf and xt+i. For this to be effective, we need the 
following conditional posterior: 

p(x t+1 \sf,y t+ i) = J p(x t +i\xt,yt+i)p(xt\st)dx t . 

We can then proceed with the particle learning al- 
gorithm. Due to this Rao-Blackwellization step, the 
weights are flatter in the first stage, that is, p{yt+\ \sf) 
versus p(yt+i\xt) increasing the efficiency of the al- 
gorithm. 

Example 1 (Cont.). Recalling (x t \6) ~ N(m t , 
Ct), then it is straightforward to see that (yt+±\m t , 
C u 9)~N(m t ,C t + a 2 + T 2 ), so sf = (m t ,C t ). The 
recursions for the state sufficient statistics vector 
sf are the well-known Kalman recursions, that is, 
m t+1 = (l-A t+1 )mt+A t+1 y t+1 and C t+ i = A t+ ia 2 , 
where A t+1 = (C t + r 2 ) / (C t + r 2 + a 2 ) is the Kalman 
gain. 

3.3 Smoothing 

Smoothing, that is, estimating the states and pa- 
rameters conditional on all available information, is 
characterized by p(x T ,9\y T ), with T denoting the 
last observation. 

After one sequential pass through the data, our 
particle approximation computes samples from 
p N (xt,st\y t ) for all t <T. However, in many situ- 
ations, we are required to obtain full smoothing dis- 
tributions p{x T \y T ) which are typically carried out 
by a MCMC scheme. We now show that our filtering 
strategy provides a direct backward sequential pass 
to sample from the target smoothing distribution. 
To compute the marginal smoothing distribution, 
we write the joint posterior of (x T ,6) as 

T-l 

p(x T ,9\y T ) = ^p(x t \x t+1 ,9,y t )p(x T ,9\y T ). 
t=\ 

By Bayes' rule and conditional independence, we 
have 

p(x t \x t+ i, 9, y f ) oc p(x t+ i \x t , 9, y t )p(x t \9, y*). 

We can now derive a recursive backward sampling 
algorithm to jointly sample from p(x T ,6\y T ) by se- 
quentially sampling from filtered particles with 



weights proportional to p{x t +i\x t ,9,y t ). In detail, 
randomly choose, at time T, (xt,§t) from the par- 
ticle approximation p N (xx, ST\y T ) and sample 9 ~ 
p(9\st)- Then, for t = T — 1, . . . , 1, choose xt = x^ 
from the filtered particles {x[ l \i = 1, . . . , N} with 
weights wty +1 ccp(x t+1 \x[ l \d): 

Particle smoothing. 

Step 1 (Forward filtering). Sample {(x T , 9y i '}^L 1 
via particle learning. 

Step 2 (Backwards smoothing). For each pair (xt, 

#)W and t = T — 1, . . . , 1, resample x^ from {x^ 
with weights 

w^cxpixfUx?^). 

This algorithm is an extension of Godsill, Doucet 
and West (2004) to state space models where the 
fixed parameters are unknown. See also Briers, Doucet 
and Maskell (2010) for an alternative SMC smoother. 
Both SMC smoothers are 0(TN 2 ), so the compu- 
tational time to obtain draws from p(x T \y T ) is ex- 
pected to be much larger than the computational 
time to obtain draws from p{xt\y t ), for t = 1, . . . , T, 
from standard SMC filters. An 0(TN) smoothing 
algorithm has recently been introduced by Fearn- 
head, Wyncoll and Tawn (2008). 

Example 1 (Cont.). For t = T - 1, ... ,2, 1, it is 
easy to see that (xt \x t+ i, y T , 9) ~ N(at, D t r 2 ) and 
(x t \y T ,9) ~ N{mJ,Cf), where a t = (1 - D t )m t + 
D t x t+1 mf = (1-D t )m t + D t mj +1 , Cj = (1- D t )C t + 
D?C? +1 , and A = C t /(C t + r 2 ). Finally, = m T 
and Crf = Ct- 

3.4 Model Monitoring 

The output of PL can be used for sequential pre- 
dictive problems but is also key in the computation 
of Bayes factors for model assessment in state space 
models. Specifically, the marginal predictive for a 
given model A4 can be approximated via 

1 N 

p N (y t+1 \y t ,M) = -Y,P(yt+^ x t,9fKM). 

i=i 

This then allows the computation of a SMC approx- 
imation to the Bayes factor Bt+i or sequential likeli- 
hood ratios for competing models Mo and A4\ (see, 
e.g., West, 1986): 

= p(yi,...,y t +i\Mi) 
t+ v{yu ■ ■ ■ ,yt+i\M Q y 
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where p(yi y t+1 \Mi) = Iljt 1 P(Uj W X , Mi), tor 
either model. 

Model monitoring. 

Step 1. Compute the predictive using 

1 N 

p N (y t+1 \y t ) = -'£p(y t+1 \(x t ,9)®). 
i=l 

Step 2. Compute the marginal likelihood 

t+l 

p N (yu- ■ -,yt+i) = Y[p A '(vj+iW)- 

3=1 

An important advantage of PL over MCMC 
schemes is that it directly provides the filtered joint 
posteriors p(xt, 9\y l ) and, hence, p(yt+i|y*), whereas 
MCMC would have to be repeated T times to make 
that available. 

4. CONDITIONAL DYNAMIC LINEAR 
MODELS 

We now explicitly derive our PL algorithm in a 
class of conditional dynamic linear models which 
are an extension of the models considered in West 
and Harrison (1997). This consists of a vast class of 
models that embeds many of the commonly used dy- 
namic models. MCMC via Forward-filtering 
Backward-sampling (Carter and Kohn, 1994; 
Fruhwirth-Schnatter, 1994) or mixture Kalman fil- 
tering (MKF) (Chen and Liu, 2000) are the current 
methods of use for the estimation of these models. 
As an approach for filtering, PL has a number of 
advantages. First, our algorithm is more efficient, as 
it is a perfectly-adapted filter. Second, we extend 
MKF by including learning about fixed parameters 
and smoothing for states. 

The conditional DLM defined by the observation 
and evolution equations takes the form of a linear 
system conditional on an auxiliary state At+i, 

(y t+ i\x t+1 ,\ t+1 ,8) ~ N(F Xt+1 x t+ i,V Xt+1 ), 
(x t+ i\x t ,X t+ i,9) ~ N(G Xt+1 x t ,W Xt+1 ), 

with 9 containing F's, G's, V's and Ws. The 
marginal distribution of observation error and state 
shock distribution are any combination of normal, 
scale mixture of normals or discrete mixture of nor- 
mals depending on the specification of the distri- 
bution on the auxiliary state variable p(Xt+i\9), so 



that 

P(yt+i\x t +i,0) = J f N (y t+1 ; F Xt+1 x t+ i,V Xt+1 ) 

■p(X t+ i\0)dX t+1 . 

Extensions to hidden Markov specifications where 
At+i evolves according to p(Xt+i\Xt,9) are straight- 
forward and are discussed in Example 2 below. 

4.1 Particle Learning in CDLM 

In CDLMs the state filtering and parameter learn- 
ing problem is equivalent to a filtering problem for 
the joint distribution of their respective sufficient 
statistics. This is a direct result of the factorization 
of the full joint 

p(x t+1 ,9, X t+ i,s t+ i,sf +1 \y t+1 ) 

as a sequence of conditional distributions 

p(8 1 s t+1 )p(x t+ i 14+1, A t+ i)p(A t+ i , s t+1 , sf +1 \y t+1 ). 

Here the conditional sufficient statistics for states 
(sf) and parameters (st) satisfy deterministic up- 
dating rules 

(4.1) 4 +1 =/C(4',^,A m ,y m ), 

(4.2) s t +i=S(st,xt + i,Xt+i,yt+i), 

where /C(-) denotes the Kalman filter recursions and 
S(-) our recursive update of the sufficient statistics. 
More specifically, define = (mt,Ct) as Kalman fil- 
ter first and second moments at time t. Conditional 
on 9, we then have 

(x t+ i\sf +1 ,X t+ i,9,) ~ N(a t+ i,Rt + i), 

where a t+ i = G Xt+1 m t and R t+1 = G Xt+1 C t G' Xt+i + 

W Xt+1 . Updating state sufficient statistics (mt+i, Ct+i) 
is achieved by 

(4.3) mt+i = G Xt+1 m t + A t+ \(yt+\ - e t ), 

(4.4) C-\ = + F' Xt+i F Xt+1 V x -l^ 

with Kalman gain matrix At+i = Rt+iF\ t+1 Q^ +1 , 
et = F Xt+1 G Xt+1 m t , and Q t +i = F Xt+1 R t+1 F Xt+1 + 

We are now ready to define the PL scheme for the 
CDLMs. First, assume that the auxiliary state vari- 
able is discrete with At+i ~ p(Xt+i \X t , 9). We start, 
at time t, with a particle approximation for the 
joint posterior of (xt, Xt,st,sf,9\y ). Then we prop- 
agate to t + 1 by first resampling the current par- 
ticles with weights proportional to the predictive 
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p(yt+i\(0, sf )). This provides a particle approxima- 
tion to p(xt,6 , Xt, St, sf\y t+1 ) , the smoothing distri- 
bution. New states Xt+i and xt+i are then prop- 
agated through the conditional posterior distribu- 
tions p(X t+ i\Xt,9,yt+i) and p(x t+1 \X t+ i,x t ,9,y t+1 ). 
Finally, the conditional sufficient statistics are up- 
dated according to (4.1) and (4.2) and new samples 
for 9 are obtained from p(9\st+i). Notice that in the 
conditional dynamic linear models all the above den- 
sities are available for evaluation and sampling. For 
instance, the predictive is computed via 

p( yt+1 \(\ t , s ?,e)W) = J2p(yt+i\h+i,(ste)V) 

At+i 

■p(X t+ i\X t ,0), 
where the inner predictive distribution is given by 

p(yt+i \h+x, s^, 9) = J p(y t+ i\x t+ i,X t+ i,9) 

■p(x t+ i\st,9)dx t+1 . 

Starting with particle set {(xq,9, Xq, sq, Sq)W,z = 
1, . . . , N} at time t = 0, the above discussion can be 
summarized in the PL Algorithm 1. In the general 
case where the auxiliary state variable Xt is contin- 
uous, it might not be possible to integrate out At+i 
form the predictive in step 1. We extend the above 
scheme by adding to the current particle set a prop- 
agated particle Xt+i ~ p(Xt+\\(Xt, 9p i >) and define 
the PL Algorithm 2. 

Both algorithms can be combined with the back- 
ward propagation scheme of Section 3.3 to provide 
a full draw from the marginal posterior distribution 
for all the states given the data, namely, the smooth- 
ing distribution p(x\ , . . . , xt \ y T ) ■ 

Algorithm 1 (CDLM). 

Step 1 (Resample). from zf 1 = (X t ,sf,9)^ 
with weights 

Step 2 (Propagate). States 

X i ?l 1 ~p{X t+l \{Xt,~9f\yt + i), 

x?l 1 ~p( Xt+1 \(x t ,9)®,x¥l 1 ,y t+1 ). 
Step 3 (Propagate). Sufficient statistics 

s t+1 -IL{s t , X t+1 ,V y > ,y t+ i), 

J'O -5fsW r (0 \W S(i) ,,.,.) 
■ s t+i — °\ b t > x t+i' A m' t7 iVt+i)- 



Step 4 (Propagate). Parameters 0® ~p(#|s[+i)- 

Example 2 (Dynamic factor model with time- 
varying loadings). Consider data yt = (yti, Zte)', t = 
1, . . . , T, following a dynamic factor model with time- 
varying loadings driven by a discrete latent state Xt 
with possible values {1,2}. Specifically, we have 

{yt + i\x t+ i,X t+ i,9) ~ N(/3 t+ ix t+1 , o 2 h), 
(x t +i\xt,X t+ i,9) ~N(x t ,al), 

with time- varying loadings ftt+l = (L P\ t+1 )' and ini- 
tial state distribution xq ~ N(i71q,Cq). The jumps in 
the factor loadings are driven by a Markov switch- 
ing process (Xt+i\Xt,0), whose transition matrix II 
has diagonal elements Pr(A(+i = l|Aj = 1,0) = p and 
Pr(Af+i = 2|At = 2,9) = q. The parameters are 9 = 
{Pi, fa,® 2 ,t 2 ,p,q)' ■ See Carvalho and Lopes (2007) 
for related Markov switching models. 

We are able to marginalize over both (xt+i, Xt+i) 
by using state sufficient statistics sf = (m t ,Ct) as 
particles. From the Kalman filter recursions we know 
that p(xt\X l , 9, y l ) ~ N(m t , Ct). The mapping for state 
sufficient statistics (nit+i,Ct+i) = IC(m t ,Ct, Xt+i,9, 
yt+i) is given by the one-step Kalman update as in 
(4.3) and (4.4). The prior distributions are condi- 
tionally conjugate where (/% \a 2 )~ N(b l0 ,a 2 B i0 ) for 
i = 1,2, a 2 ~ IG(iW2,doo/2) and r 2 ~ IG(v 10 /2, 
dio/2). For the transition probabilities, we assume 
that p ~ Beta(pi,p2) arid q ~ Beta(gi, (fe)- Assume 
that, at time t, we have particles {(xt,9, Xt, sf ,st)^}fL 
for i = I, . . . , N , approximating p(xt,9,Xt,St,st\y t )- 
The PL algorithm can be described through the fol- 
lowing steps: 

1. Resampling: Draw an index k % ~ MvL\t(w^\ . . . , 
w\ N >) with weights cx p(y t +i \(m t , C t , X t , 
Q)( k ')) where 

p{yt+i\st,X t ,9) 

2 

= ^2 fN{yt+i;a,b)p(X t+ i\X t ,9), 

A t +i = l 

where fiy(x;a,b) denotes the density of the nor- 
mal distribution with mean a and variance b and 
evaluation at the point x. Here a = f3t+iirit and 
b={Ct + T 2 )Pt +1 p' t+1 + a 2 I 2 . 

2. Propagating state A: Draw A® 2 from p(Xt+\ \(sf , 
Xt,9)( k '\y t+1 ): 

p(k+i\st,X t ,9,y t+ i) 

oc f N (yt+r, Pt+imt, (Ct + T 2 )/3 t+1 /3' t+1 + a 2 I 2 ) 

■p(X t+ i\X t ,9). 
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3. Propagating state x: Draw xfh from p(xt+i\X£Li, 
(sf,9)^\y t+1 ). 

4. Propagating sufficient statistics for states: The 
Kalman filter recursions yield 



m t+1 =m t + A t+1 (y t+1 - /3 t+1 m t ) 



C t +i = C t + T 2 — A t +iQ t+1 A' t 



where Q t+1 = (C t + T 2 )p t+1 /3 t+ i + a 2 I 2 and 
A t+1 = (C t + T 2 )Q-^p t+1 . 
5. Propagating sufficient statistics for parameters: 
The conditional posterior p(0\st), for i = 1,2, is 
decomposed into 

p(/3i\a 2 , s t+ i) ~ N(b itt+1 ,o- 2 Bij +1 ), 
p(o- 2 \s t+1 )~IG{v ,t+i/2,d ,t+i/2t), 
p(T 2 \s t+1 )~IG{v w /2,d w /2), 
p(p\s t+ i) ~ Beta(pi ; t + i,p 2 ,t+i), 
p(q\s t+ i) ~ Beta(gi,i + i,52 J t+i), 

with B i,t+i = B it l + ^?+ilA t+1 =i, &i,t+i = B i>t+ i • 
{B~ t l b it + x t ?/i2lA t+ i=i) and = z^t + 1, for 

£ = 1,2, di )t+ i = dit + - x t ) 2 , pi )t+ i =pit + 
lA ( =l,A t+ i=l! P2,t+1 = P2t + lA ( =l,A t+ i=2, qi,t+l = 
git + lA t =2,A t+ i=2 <?2,t+l = <?2t + lA t =2,A t+ i=l and 
<2o,t+l = rf 0t + J2 2 j=i[(yt+1,2 ~ bjj+lX t+ l)yt+l,2 + 

h i,t+\BjQ + - Xi+i) 2 ]I At+1=i . 

Figures 1 and 2 illustrate the performance of the 
PL algorithm. The first panel of Figure 1 displays 
the true underlying A process along with filtered 
and smoothed estimates, whereas the second panel 
presents the same information for the common fac- 
tor. Figure 2 provides the sequential parameter learn- 
ing plots. 

Algorithm 2 (Auxiliary state CDLM). Let z t = 

Step (Propagate). A^ to via A^ x ~p(At+i| 
(A t ,0)«). 

Step 1 (Resample). zjp from zf' with weights 

Step 2 (Propagate), to x£L x via p{xt+\\zf > , 

yt+i)- 

Step 3 (Propagate). Sufficient statistics as in PL. 
Step ^ (Propagate). Parameters as in PL. 



5. NONLINEAR FILTERING AND LEARNING 

We now extend our PL filter to a general class 
of nonlinear state space models, namely, the condi- 
tional Gaussian dynamic model (CGDM). This class 
generalizes conditional dynamic linear models by al- 
lowing nonlinear evolution equations. In this context 
we take advantage of most efficiency gains of PL, as 
we are still able to follow the resample/propagate 
logic and filter sufficient statistics for 9. Consider a 
conditional Gaussian state space model with nonlin- 
ear evolution equation, 

(5.1) (y t+ i\x t+ i,X t+1 ,e) ~ N(F Xt+1 x t+ i,V\ t+1 ), 

(5.2) (x t+ i \x t ,\ t+1 ,6) ~ N(G Xt+1 h(x t ),W Xt+1 ), 

where h(-) is a given nonlinear function and, again, 
9 contains F's, G"s, Vs and W's. Due to the non- 
linearity in the evolution, we are no longer able to 
work with state sufficient statistics sf, but we are 
still able to evaluate the predictive p(yt+i\xt, \t,9). 
In general, take as the particle set the following: 
{(x t ,9,X u st) ( - i \i = l,...,N}. For discrete A we can 
define the following algorithm: 

Algorithm 3 (CGDM). 

Step 1 (Resample). zf* from zf 1 = (xt,Xt,9)^ 
with weights 



w? <xp(y t+l \{x u X u 9)^) 
Step 2 (Propagate). States 



.p(A m |(A^)«,y m ), 

xfl^pixt^KxJpAfl^yt+i). 

Step 3 (Propagate). Parameter sufficient statistics 
as in Algorithm 1. 

Step 4 (Propagate). Parameters as in PL. 

When A is continuous, propagate xfh from p(Xt+i \ 
(At, 6)^'), for i = 1, . . . , N, then we resample the par- 
ticle (xt, At+i, 0, st)W with the appropriate predic- 
tive distribution p(y t +i\(x t , Xt+1,9)^) as in Algo- 
rithm 2. Finally, it is straightforward to extend the 
backward smoothing strategy of Section 3.3 to ob- 
tain samples from p{x T \y T ). 

Example 3 (Heavy-tailed nonlinear state space 
model). Consider the following non-Gaussian and 
nonlinear state space model 

(y t+1 \x t+ i, At +1 ,0) ~ N(x t+ i, Xt+icr 2 ), 

(x t+ i\x t ,Xt+i,d) ~ N((3h(x t ),al), 
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T 1 1 1 1 1 1 - 

50 100 150 200 250 300 

State x(t) 




Fig. 1. Dynamic factor model (state learning). Top panel: True value of At (red line), Pr(At = X | S/' ) (black line) and 
Pr(A t = l\y T ) (blue line). Bottom panel: True value of Xt (red line), E(xt\y t ) (black line) and E(xt\y T ) (blue line). 



where = (/3,a 2 ,T 2 ), h(x t ) = x t /{l + x 2 ) and X t +i ~ 
IG(z//2, z//2), for known v. Therefore, the distribu- 
tion of (y t+ i\x t +i,0) ~ t v (x t +i,(T 2 ), that is, a 
t-Student with v degrees of freedom. 

The particle learning algorithm works as follows. 
Let the particle set {(xt,9,Xt+i,st)^}fL 1 approxi- 
mate p(xt,9, Xt+i, st|y*). For anygiven time t = 0, . . . , 
T — 1 and i = 1, . . . , N , we first draw an index k l ~ 

Mult(i4 )> with w[ j) oc p(y t+ i\(x t , X t +i, 

3 = and p(y t+ i\x t , X t+ i,6) = 

/jv(yt+i; /3h(xt), Xt+ia 2 + r 2 ). Then, we draw anew 
state x^ ^ p(x t+ i\(Xt + i,xt,6) ( - kl \y t+ i) = f N (x t+ i; 

Mt+i,^), where /i m = I4+i(A^ 1 <7~ 2 yt+i + t~ 2 ■ 
Ph(x t )) and ^ = X^a' 2 + r~ 2 . Finally, simi- 
lar to Example 1, posterior parameter learning for 
6 = ((3,a 2 ,T 2 ) follows directly from a conditionally 
normal-inverse gamma update. Figure 3 illustrates 
the above PL algorithm in a simulated example where 
P = 0.9, a 2 = 0.04 and a 2 = 0.01. The algorithm un- 



covers the true parameters very efficiently in a se- 
quential fashion. In Section 6.1 we revisit this ex- 
ample to compare the performances of PL, MCMC 
(Carlin, Poison and Stoffer, 1992) and the bench- 
mark particle filter with parameter learning (Liu 
and West, 2001). 

6. COMPARING PARTICLE LEARNING TO 
EXISTING METHODS 

We now present a series of examples that illustrate 
the performance of PL benchmarked by commonly 
used alternatives. 

Example 4 (State sufficient statistics). In this 
first simulation exercise we revisit the local level 
model of Example 1 in order to compare PL to 
its version that takes advantage of state sufficient 
statistics, that is, by marginalizing the latent states. 
The main goal is to study the Monte Carlo error of 
the two filters. We simulated a time series of length 
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Fig. 2. Dynamic factor model (parameter learning). Sequential posterior median (black line) and posterior 95% credibility 
intervals (blue lines) for model parameters fj\, 02, cr 2 , r 2 , p and q. True values are the red lines. 



T = 100 with a 2 = 1, r 2 = 0.1 and x = p . The prior 
distributions are a 2 ~ IG(5,4), t 2 ~ IG(5, 0.4) and 
xq ~ iV(0, 10). We run two filters: one with sequen- 



tial learning for xt , a 2 and r 2 (we call it simply PL) , 
and the other with sequential learning for state suf- 
ficient statistics, a 2 and r 2 (we call it PLsuff). In 
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both cases, the particle filters are based on either 
one long particle set of size N = 100,000 (we call 
it Long) or 20 short particle sets of size N = 5000 
(we call it Short). The results are in Figures 4 to 
6. Figure 4 shows that the differences between PL 
and PLsuff dissipate for fairly large N. However, 
when N is small PLsuff has smaller Monte Carlo 
error and is less biased than PL, particularly when 
estimating a 2 and r 2 (see Figure 5). Similar findings 
appear in Figure 6 where the mean square errors of 
the quantiles from the 20 Short runs are compared 
to those from the Long PLsuff run. 

Example 5 (Resample-propagate or propagate- 
resample?). In this second simulation exercise we 
continue focusing in the local level model of Ex- 
ample 1 to compare PL to three other particle fil- 
ters: the bootstrap filter (BF), its fully adapted ver- 



sion (FABF), and the auxiliary particle filter (APF) 
(no fully adapted). BF and FABF are propagate- 
resample filters, while PL and APF are resample- 
propagate filters. The main goal is to study the Monte 
Carlo error of the four filters. We start with the pure 
case scenario, that is, with fixed parameters. We 
simulated 20 time series of length T = 100 from the 
local level model with parameters r 2 = 0.013, a 2 = 
0.13 and xo = 0. Therefore, the signal to noise ra- 
tio o x ja equals 0.32. Other combinations were also 
tried and similar results were found. The prior dis- 
tribution of the initial state xq was set at iV(0, 10). 
For each time series, we run 20 times on each of 
the four filters, all based on N = 1000 particles. 
We use five quantiles to compare the various fil- 
ters. Let g* be such that Pr(x t < qf \y l ) = a, for a = 
(0.05, 0.25, 0.5, 0.75, 0.95). Then, the mean square er- 



beta tau2 




t 1 1 1 1 1 f 1 1 1 r~ 

100 200 300 400 500 100 200 300 400 500 



sigma2 




100 200 300 400 500 -0.4 -0.2 0.0 0.2 4 

True State 

Fig. 3. Heavy-tailed non-Gaussian, nonlinear model. Sequential posterior median and posterior 95% credibility intervals 
(black lines) for model parameters /3, a 2 and t 2 . True values are the red lines. The bottom right panel is the true value of Xt 
against E(xt\y t ) ■ 
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ror (MSE) for filter /, at time t and quantile a is 

1 20 20 

d=l r=l 

where d and r index the data set and the particle 
filter run, respectively. We compare PL, APF and 
FABF via logarithm relative MSE (LRMSE), rela- 
tive to the benchmark BF. Results are summarized 
in Figure 7. PL is uniformly better than all three 
alternatives. Notice that the only algorithmic differ- 
ence between PL and FABF is that PL reverses the 
propagate-resample steps. 

We now move to the parameter learning scenario, 
where a 2 is still kept fixed but learning of r 2 is per- 
formed. Three time series of length T = 1000 were 
simulated from the local level model with xq = and 
(cr 2 ,r 2 ) in {(0.1, 0.01), (0.01, 0.01), (0.01, 0.1)}. The 
independent prior distributions for xq and r 2 are 
x ~ iV(0,l) and r 2 ~ IG(10,9r 2 ), where r 2 is the 
true value of r 2 for a given time series. In all filters 
t 2 is sampled offline from p(T 2 \st) where st is the 
vector of conditional sufficient statistics. We run the 
filters 100 times, all with the same seed within run, 
for each one of the three simulated data sets. Finally, 
the number of particles was set at N = 5000, with 
similar results found for smaller N, ranging from 250 
to 2000 particles. Mean absolute errors (MAE) over 
the 100 replications are constructed by comparing 
quantiles of the true sequential distributions p{xt\y ) 
andp(r 2 |y*) to quantiles of the estimated sequential 



distributions p N (xt\y t ) and p N (T 2 \y t ). More specifi- 
cally, for time t, a in {x,t 2 }, a in {0.01,0.50,0.99}, 
true quantiles qf a and PL quantiles qf ar , 

100 

MAE- a = m J2Ka-Q?,a,rl 
r=l 

Across different quantiles and combinations of er- 
ror variances, PL is at least as good as FABF and 
in many cases significantly better than BF. Results 
appear in Figure 8. 

Example 6 (PL versus LW) . Consider once again 
a variation of the dynamic linear model introduced 
in Example 1, but now we assume complete knowl- 
edge about (a 2 ,r 2 ) in 

(yt+i\xt+i,P) ~ N(x t ,a 2 ), 

(x t+1 \x t ,f3)~N{(3xt,T 2 ) 

for t = l,...,T = 100, a 2 = 1, xx = 0.0 and three 
possible values for r 2 = (0.01, 0.25, 1.00). So, the sig- 
nal to noise ratio r/cr = 0.1, 0.5, 1.0. Only f3 and 
xt are sequentially estimated and their independent 
prior distributions are 7V(1. 0,1.0) and iV(0.0, 1.0), 
respectively. The particle set has length N = 2000 
and both filters were run 50 times to study the size 
of the Monte Carlo error. The smoothing parameter 
5 of Liu and West's filter was set at 6 = 0.95, but 
fairly similar results were found for 5 ranging from 
0.8 to 0.99. Our findings, summarized in Figure 9, fa- 
vor PL over LW uniformly across all scenarios. The 
discrepancy is higher when r/cr is small, which is 
usually the case in state space applications. 



State sig2 tau2 
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20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 



Fig. 4. PL and PL with state sufficient statistics (long runs). Left panel — p(xt\y t ) — PL (black), PLsuff (red); Middle 
panel — p(o- 2 |y*) — PL (solid line), PLsuff (dotted line); Right panel — p(t 2 |2/ 4 ) — PL (solid line), PLsuff (dotted line). 
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Fig. 5. PL and PL with state sufficient statistics (20 short runs). PL runs (left columns) and PLsuff runs (right columns). 
One long run (black) and 20 short runs (gray); p(xt\y t ) (top row), p(a 2 \y t ) (middle row) anrfp(r 2 |y*) (bottom row). 



6.1 PL vs MCMC 

PL combined with the backward smoothing algo- 
rithm (as in Section 3.3) is an alternative to MCMC 
methods for state space models. In general, MCMC 
methods (see Gamerman and Lopes, 2006) use 
Markov chains designed to explore the posterior dis- 
tribution p(x T ,9\y T ) of states and parameters condi- 



Dt)- F° r example, an MCMC strategy would have 
to iterate through 

p(6\x T ,y T ) and p(x T \9,y T ). 

However, MCMC relies on the convergence of very 
high-dimensional Markov chains. In the purely con- 
ditional Gaussian linear models or when states are 



tional on all the information available, y = (y±, . . . , dicrete, p(x \9,y ) can be sampled in block using 
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Fig. 6. PL and PL with state sufficient statistics (mean square errors). Logarithm of the relative mean square error for three 



quantiles of p N {xt\y t ), p N (a 2 \y t ) and p N (t 2 ^) , 



across the 20 N = 5000 runs. PL relative to PLsuff. 



FFBS. Even in these ideal cases, achieving conver- 
gency is far from an easy task and the computa- 
tional complexity is enormous, as at each iteration 
one would have to filter forward and backward sam- 
ple for the full state vector x T . The particle learning 
algorithm presented here has two advantages: (i) it 
requires only one forward/backward pass through 
the data for all N particles and (ii) the approxima- 
tion accuracy does not rely on convergence results 
that are virtually impossible to assess in practice 
(see Papaspiliopoulos and Roberts, 2008). 



In the presence of nonlinearities, MCMC methods 
will suffer even further, as no FFBS scheme is avail- 
able for the full state vector x T . One would have to 
resort to univariate updates of p(xt\x/ t \, 9, y T ) as in 
Carlin, Poison and Stoffer (1992), where x/ t \ is x T 
without xt . It is well known that these methods gen- 
erate very "sticky" Markov chains, increasing com- 
putational complexity and slowing down convergence. 
PL is also attractive given the simple nature of its 
implementation (especially if compared to more novel 
hybrid methods). 



Percentile = 5 Percentile = 25 Percentile = 50 Percentile = 75 Percentile = 95 




APF FABF PL APF FABF PL APF FABF PL APF FABF PL APF FABF PL 



Fig. 7. APF, FABF and PL pure filter. Logarithm of the relative mean square error for five quantiles of p N (xt\y f ) ■ MSE 
relative to BF. Boxplots on the second row are based on the time series plots on the first row. 
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Fig. 8. BF, FABF and PL with learning of r 2 . Mean absolute errors. BF (black), FABF (red) and PL (blue). 



Example 7 (PL versus FFBS). We revisit the 
first order dynamic linear model introduced in Ex- 
ample 1 to compare our PL smoother and the forward- 
filtering, backward-sampling (FFBS) smoother. As- 
suming knowledge about 9, Figure 10 compares the 
true smoothed distributions p(xt\y T ) to approxima- 
tions based on PL and on FFBS. Now, when param- 
eter learning is introduced, PL performance is com- 
parable to that of the FFBS when approximating 
p(a 2 ,T 2 \y T ), as shown in Figure 11. We argue that, 
based on these empirical findings, PL and FFBS 
are equivalent alternatives for posterior computa- 
tion. We now turn to the issue of computational 
cost, measured here by the running time in sec- 
onds of both schemes. Data was simulated based 
on (a 2 ,T 2 ,xo) = (1.0,0.5,0.0). The prior distribu- 
tion of xq is iV(0,100), while a 2 and r 2 are kept 



fixed throughout this exercise. PL was based on N 
particles and FFBS based on 2N iterations, with the 
first M discarded. Table 1 summarizes the results. 
For fixed N, the (computational) costs of both PL 
and FFBS increase linearly with T, with FFBS twice 
as fast as PL. For fixed T, the cost of FFBS increases 
linearly with TV, while the cost of PL increases ex- 
ponentially with N. These findings were anticipated 
in Section 3.3. As expected, PL outperforms FFBS 
when comparing filtering times. 

Example 8 (PL versus single- move MCMC). Our 
final example compares PL to a single-move MCMC 
as in Carlin, Poison and Stoffer (1992). We con- 
sider the first order conditional Gaussian dynamic 
model with nonlinear state equation as defined in 
Example 3. The example focuses on the estimation 
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Fig. 9. PL and LW (parameter learning). Posterior mean and 95% credibility interval from p(fi\y ) ■ Medians across the 50 
runs appear in red. N — 2000 particles, signal-to-noise stands for o~ x jo~. In all cases, o~ = \. 



of a 2 . We generate data with different levels of sig- the comparisons. Once again, PL provides signifi- 
nal to noise ratio and compare the performance of cant improvements in computational time and MC 
PL versus MCMC. Table 2 presents the results for variability for parameter estimation over MCMC. 
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Smoothed distributions 



Kullback Leibler divergence 





PL 



T 
MCMC 



Time 

Fig. 10. PL and FFBS (smoothed distributions) . T = 100 simulated from a local level model with a 2 — 1, r 2 = 0.5, xo — 
and xq ~ N(0, 100). PL is based on N = 1000 particles, while FFBS is based on 2N draws with the first N discarded. 
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Fig. 11. PL and FFBS (parameter learning). Contour plots for the true posterior p(a 2 ,r 2 \y T ) (red contours) and posterior 
draws from PL, panels (a) and (c), and FFBS, panels (b) and (d). The blue dots represent the true value of the pair (a 2 ,T 2 ). 
The sample size is T = 50 (top row) and T = 500 (bottom row). 
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Table 1 

Computing time (in seconds) of PL and FFBS for 
smoothing. In parenthesis are PL times for filtering 





N = 500 






T = 100 




T 


PL 


FFBS 


N 


PL 


FFBS 


200 


18.8 (0.25) 


9.1 


500 


9.3 (0.09) 


4.7 


500 


47.7 (1.81) 


23.4 


1000 


32.8 (0.15) 


9.6 


1000 


93.9 (8.29) 


46.1 


2000 


127.7 (0.34) 


21.7 



Table 2 

Single-move MCMC based on 2000 draws, after 2000 
burn-in. PL based on 2000 particles. Expectations are with 
respect to the whole data set at time T , while the true value 
of t 2 is 0.01. Numbers in parenthesis are 1000 times the 
standard deviation based on 20 replications of the 
algorithms. Time is in seconds when running our code in R 
version 2.8.1 on a MacBook with a 2.4 GHz processor and 4 
GB MHz of memory 



T 


a 2 


Time 


E(o 


2 ) 


E(t 2 ) 






Sin! 


de-move MCMC 






50 


0.2500 


19.7 


0.209934 


(3.901) 


0.011 


(1.532) 




0.0100 


19.3 


0.009151 


(0.253) 


0.008 


(0.545) 




0.0001 


19.3 


0.000097 


(0.003) 


0.010 


(0.049) 


200 


0.2500 


79.3 


0.249059 


(6.981) 


0.027 


(12.76) 




0.0100 


79.1 


0.009740 


(0.305) 


0.013 


(1.375) 




0.0001 


79.8 


0.000099 
PL 


(0.004) 


0.011 


(0.032) 


50 


0.2500 


0.8 


0.170576 


(1.633) 


0.010 


(0.419) 




0.0100 


0.7 


0.007204 


(0.151) 


0.008 


(0.165) 




0.0001 


0.0 


0.000092 


(0.004) 


0.010 


(0.058) 


200 


0.2500 


6.5 


0.262396 


(6.392) 


0.009 


(1.332) 




0.0100 


6.4 


0.010615 


(0.570) 


0.011 


(0.935) 




0.0001 


6.4 


0.000098 


(0.010) 


0.011 


(0.057) 



7. FINAL REMARKS 

In this paper we provide particle learning tools 
(PL) for a large class of state space models. Our 
methodology incorporates sequential parameter learn- 
ing, state filtering and smoothing. This provides an 
alternative to the popular FFBS/MCMC (Carter 
and Kohn, 1994) approach for conditional dynamic 
linear models (DLMs) and also to MCMC approaches 
to nonlinear non-Gaussian models. It is also a gen- 
eralization of the mixture Kalman filter (MKF) ap- 
proach of Chen and Liu (2000) that includes pa- 
rameter learning and smoothing. The key assump- 
tion is the existence of a conditional sufficient statis- 
tic structure for the parameters which is commonly 
available in many commonly used models. 



We provide extensive simulation evidence to ad- 
dress the efficiency of PL versus standard methods. 
Computational time and accuracy are used to assess 
the performance. Our approach compares very fa- 
vorably with these existing strategies and is robust 
to particle degeneracies as the sample size grows. 
Finally, PL has the additional advantage of being 
an intuitive and easy-to-implement computational 
scheme and should, therefore, become a default choice 
for posterior inference in a variety of models, with 
examples already appearing in Lopes et al. (2010), 
Carvalho et al. (2009), Prado and Lopes (2010), 
Lopes and Tsay (2010) and Lopes and Poison (2010). 
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